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Abstract 

We consider the problem of supervised learning with convex loss functions and propose a new form 
of iterative regularization based on the subgradient method. Unlike other regularization approaches, 
in iterative regularization no constraint or penalization is considered, and generalization is achieved 
by (early) stopping an empirical iteration. We consider a nonparametric setting, in the framework of 
reproducing kernel Hilbert spaces, and prove finite sample bounds on the excess risk under general 
regularity conditions. Our study provides a new class of efficient regularized learning algorithms and 
gives insights on the interplay between statistics and optimization in machine learning. 

1 Introduction 

Availability of large high-dimensional data-sets has motivated an interest in the interplay between statis¬ 
tics and optimization, towards developing new, more efficient learning solutions [8]. Indeed, while much 
theoretical work has been classically devoted to study statistical properties of estimators defined by 
variational schemes (e.g. Empirical Risk Minimization [36] or Tikhonov regularization [35]), and to the 
computational properties of optimization procedures to solve the corresponding minimization problems 
(see e.g. [32]), much less work has considered the integration of statistical and optimization aspects, see 
for example [15, 39, 25]. 

With the latter objective in mind, in this paper, we focus on so called iterative regularization. This 
class of methods, originated in a series of work in the mid-eighties [23, 26], is based on the observation that 
early termination of an iterative optimization scheme applied to an ill-posed problem has a regularization 
effect. A critical implication of this fact is that the number of iterations serves as a regularization 
parameter, hence linking modeling and computational aspects: computational resources are directly 
linked to the generalization properties in the data, rather than their raw amount. Further, iterative 
regularization algorithms have a built-in ’’warm restart” property which allows to compute automatically 
a whole sequence (path) of solutions corresponding to different levels of regularization. This latter 
property is especially relevant to efficiently determine the appropriate regularization via model selection. 

Iterative regularization techniques are well known in the solution of inverse problems, where several 
variants have been studied, see [18, 21] and references therein. In machine learning, iterative regularization 
is often simply referred to as early stopping and is a well known ’’trick”, e.g. in training neural networks 
[22]. Theoretical studies of iterative regularization in machine learning have mostly focused on the least 
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squares loss function [11, 41, 7, 27]. Indeed, it is in this latter case that the connection to inverse problems 
can be made precise [38] . Interestingly, early stopping with the square loss has been shown to be related 
to boosting [11] and also to be a special case of a large class of regularization approaches based on spectral 
filtering [19, 4]. The regularizing effect of early stopping for loss functions other than the least squares 
one has hardly been studied. Indeed, to the best of our knowledge the only papers considering related 
ideas are [3, 6, 20, 45], where early stopping is studied in the context of boosting algorithms. 

This paper is a different step towards understanding how early stopping can be employed with gen¬ 
eral convex loss functions. Within a statistical learning setting, we consider convex loss functions and 
propose a new form of iterative regularization based on the subgradient method, or the gradient descent 
if the loss is smooth. The resulting algorithms provide iterative regularization alternatives to support 
vector machines or regularized logistic regression, and have built in the property of computing the whole 
regularization path. Our primary contribution in this paper is theoretical. By integrating optimization 
and statistical results, we establish non-asymptotic bounds quantifying the generalization properties of 
the proposed method under standard regularity assumptions. Interestingly, our study shows that con¬ 
sidering the last iterate leads to essentially the same results as considering averaging, or selecting of the 
’’best” iterate, as typically done in subgradient methods [9]. From a technical point of view, considering 
a general convex loss requires different error decompositions than those for the square loss. Moreover, 
operator theoretic techniques need to be replaced by convex analysis and empirical process results. The 
error decomposition we consider, accounts for the contribution of both optimization and statistics to the 
error, and could be useful also for other methods. 

The rest of the paper is organized as as follows. We begin in Section 2 by briefly recalling the 
supervised learning problem, and then introduce our learning algorithm, discuss its numerical realization. 
In Section 3, after discussing the assumptions that underlie our analysis, we present our main theorems 
with discussions and discuss the general error decomposition which are composed of three error terms: the 
computational, the sample and approximation error terms. In Section 4, we will estimate computational 
error, while in Section 5, we develop sample error bounds, and hnally prove our main results. 

2 Learning Algorithm 

After briefly recalling the supervised learning problem, we introduce the algorithm we propose and give 
some comments on its numerical realization. 

2.1 Problem Statement 

In this paper we consider the problem of supervised learning. Let X he a separable metric space, F C R 
and let p be a Borel probability measure on Z = X x Y. Moreover, let F : R x R —>■ R+ be a so called loss 
function, measuring the local error V{y,f{x)) for (x, y) G Z and / : If —> R. The generalization error 
(or expected risk) E = E'^ associated to V is given by 

^(/) = f V{y,f{x))dp, 

Jz 

and is well defined for any measurable loss function V and measurable function /. We assume throughout 
that there exists a function that minimizes the expected error E{f) among all measurable functions 
f : X ^ Y. Roughly speaking, the goal of learning is to find an approximation of when the measure p 
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is known only through a sample z = {zi = (xi, j/i)}™ i of size m G N independently and identically drawn 
according to p. More precisely, given z the goal is to design a computational procedure to efficiently 
estimate a function /z, an estimator, for which it is possible to derive an explicit probabilistic bound on 
the excess expected risk 

We end with this section with a remark and an example. 

Remark 2.1. For several loss functions, it is possible to show that exists- see example below. How¬ 
ever, as will be seen in the following, the search for an estimator in practice is often restricted to some 
hypothesis space % of measurable functions. In this case one should replace £{f^) by inf£!(/). In¬ 
terestingly, examples of hypothesis spaces are known for which £{f^) = inf/g^ £!(/), namely universal 
hypothesis spaces [34]. In the following, we consider £{f^), with the understanding that it should be 
replaced by the infimum over H, if the latter is not universal. 

The following example gives several possible choices of loss functions. 

Example 2.1. The most classical example of loss function is probably the square loss V{y,a) = {y — 
a)^, y, a G R. In this case, fj is the regression function, defined at every point as the expectation 
of the conditional distribution of y given x [17, 3]]. Further examples include the absolute value loss 
V{y,a) = |y — a| for which f^ is the median of the conditional distribution and more generally p-loss 
functions V{y,a) = \y — a\P, p G N. Vapnik’s e-insensitive loss V{y,a) = max{|y — a| — e,0}, e > 0 
and its generalizations V{y,a) = maxHy — a|^ — e,0}, e > 0,p > 1 provide yet other examples. For 
classification i.e. Y = {±1}, other examples of loss functions used in classification, include the hinge loss 
V{y,a) = max{l —ya,0} , the logistic loss V{y,a) = log(l+e“^“) and the exponential loss V(y,a) = e“^“. 
For all these examples fj can he computed, see e.g. [34], and measurability is easy to check. 

2.2 Learning via Subgradient Methods with Early Stopping 

To present the proposed learning algorithm we need a few preliminary definitions. Consider a reproducing 
kernel K : X x X ^ IS., that is a symmetric function, such that the matrix {K(ui,Uj))] is positive 
semidefinite for any finite set of points in X. Recall that a reproducing kernel K defines a 

reproducing kernel Hilbert space (RKHS) {TLk, || • ||/f) as the completion of the linear span of the set 
{Kx{-) ■= K{x, •) : X G X} with respect to the inner product (/W, Ku)k '■= K{x, u)\[2]. Moreover, assume 
the loss function V to be measurable and convex in its second argument, so that the corresponding left 
derivative V]_ exists and is non-decreasing at every point. For a step size sequence {rjt > 0}, a stopping 
iteration T > 2 and a initial value /i = 0, we consider the iteration 

^ m 

ft+i = ft-r]t—'^V]_{yj,ft{xj))K,,,^, t = l,...,T. (2.1) 

™ i=i 

The above iteration corresponds to the subgradient method [5, 10] for minimizing the empirical error 
£z = £2 with respect to the loss V, which is given by 

.. m 

£M) = — 

^ j=i 
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Indeed, it is easy to see that ^ G dS^if), the subgradient of the empirical risk for 

/ € TLk- In the special case where the loss function is smooth then (2.1) reduces to the gradient descent 
algorithm. Since the subgradient method is not a descent algorithm, rather then the last iterate, the so 
called Cesaro mean is often considered, corresponding, for T £ N, to the following weighted average 

T 

aT = '^ uJtft , ^t = —^ -, t = l,...,r. (2.2) 

t=i 2Zt=i Vt 

Alternatively, the best iterate is also often considered, which is defined for T £ N by 

br = argminfz)/*). (2.3) 

,T 

In what follows, we will consider the learning algorithms obtained considering these different choices. 

We note that, classical results [5, 10, 9] on the subgradient method focus on how the iteration (2.1) 
can be used to minimize Different to these studies, in the following we are interested in showing how 
iteration (2.1) can be used to dehne a statistical estimator, hence a learning algorithm to minimize the 
expected risk £, rather than the empirical risk We end with one remark. 

Remark 2.2 (Early Stopping SVM and Kernel Perceptron). If we consider the hinge loss function 
in (2.1), the corresponding algorithm is closely related to a batch (kernel) version of the perceptron [29, 1], 
where an entire pass over the data is done before updating the solution. Such an algorithm can also be 
seen as an early stopping version of Support Vector Machines [16]. Interestingly, in this case the whole 
regularization path is computed incrementally albeit sparsity could be lost. We defer to a future work the 
study of the practical implications of these observations. 


2.3 Numerical Realization 


The simplest case to derive a numerical procedure from Algorithm 2.1 is when AT = for some c? £ N 
and K is the associated inner product. In this case it is straightforward to see that ft+i{x) = for 

all a; £ X, with 

^ m 

Wt+I =Wt - T]t—'^V]_{yj,wJ Xj)xj, t = l,...,T, 

™ i=i 


and wi = 0. 


Beyond the linear kernel, it can be easily seen that given a finite dictionary 


{(()i : XR,i = 1,... ,p}, p£N, 


one can consider the kernel Ar(a;,a:') = In this case, it holds ft+i{x) = X;Li wl+icfi^x) = 

wj^i^{x), <l>(a;) = {(j)i{x ),..., <[p{x)) for all x G X, with 


Wt+I =Wt - r]t^'^VL{yj,wJ^j^<I>{xj))<I>{Xj), 
™ j=i 


t = l,...,T, 


and wi = 0. Finally, for a general kernel it is easy to prove by induction that ft+i{x) = c^+iAr(x, Xj) 

for all a: £ AT, with 

Ct+i = Ct rjt 5ti t = 1,..., T, 
m 
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for Cl = 0 and gt G K™ with g^ = Indeed, The base case is straightforward to 

check and moreover by the inductive hypothesis 

m . m ^ / 1 

ft+i = ~ - Vt—VL{yj,ft{xj)) 

i=i ™ i=i i=i ^ 

3 Main Results with Discussions 

After presenting our main assumptions, in this section we state and discuss our main results. 

3.1 Assumptions 

Our learning rates will be stated under several conditions on the triple {p,V,K), that we describe and 
comment next. We begin with a basic assumption. 

Assumption 3.1. We assume the kernel to be bounded, that is k = sup,j,g^ K(x, x) < oo and moreover 
ll/^TlIoo < oo o,nd |y|o := supj^gy l^(y, 0) < oo. Furthermore, we consider the following growth condition 
for the left derivative V!_{y, •). For some q>0 and constant Cq > 0, it holds, 

\Vfiy,a)\<Cq{l + \a\‘^), VaGR,yeF. (3.1) 

The boundness conditions on K, f^ and V are fairly common [17, 34]. They could probably be weakened 
by considering a more involved analysis which is outside the scope of this paper. Interestingly, the growth 
condition on the left derivative of V is weaker than assuming the loss, or its gradient, to be Lipschitz in 
its second entry, as often done both in learning theory [17, 34] and in optimization [9]. We note that the 
growth condition (3.1) is implied by the requirement for the loss function to be Nemitiski, as introduced 
in [37] (see also [34]). This latter condition, which is satisfied by most loss function, is natural to provide 
a variational characterization of the learning problem. 

The second assumption refines the above boundness condition by considering a variance-expectation 
bound which quantifies a notion of noise in the measure p with respect to the balls Bn = {/ G T-Lk '■ ll/lk < R} 
in Uk, [17, 34]. 

Assumption 3.2. We assume that there exist an exponent t G [0,1] and a positive constant Cr such that 
for any i? > 1 and f G Bn, we have 

{(1/(2/, f[x)) - V{y, /; (,x))^}dp < CrR^+o-^ {£{f) - f (/;)}" . (3.2) 

Assumption 3.2 always holds true for r = 0, in which case Cr will also depend on ||/^||oo. In classification, 
the above condition can be related to the so called Tsybakov margin condition. The latter quantifies the 
intuition that a classification problem is hard if the conditional probability of y given x is close to 1/2 for 
many input points. More precisely if we denote by p{y\x) the conditional probability for all {x,y) G Z 
and by px the marginal probability on X, then we say that p satishes the Tsybakov margin condition 
with exponent s if there exists a constant C > 0 such that for all <5 > 0 

pxiixGX : |p(l|x)-i|<5})<(CT)^ 

Interestingly, under Tsybakov margin condition Assumption 3.2 holds with r = and with Ct depend¬ 
ing only C. 

The third condition is about the decay of a suitable notion approximation error [31]. 
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Assumption 3.3. Let f\ be a minimizer of: 


fx ■■= argmin£:(/A) + A||/||^. (3.3) 

The approximation error associated with the tripe {p, V, K) is defined by 

Vi\)=£{fx)-£{f^) + X\\f\\j,. (3.4) 

We assume that for some fd € (0,1] and cp > 0, the approximation error satisfies 

V{\) < cpX^, V A > 0. (3.5) 


The above assumption is standard when analyzing regularized empirical risk minimization and is related 
to the the definition of interpolation spaces by means of so the called K- functional [17]. Interestingly, 
we will see in the following that it is also important when analyzing the approximation properties of the 
subgradient algorithms 2.1. 

Finally, the last condition characterizes the capacity of a ball in the RKHS T-Lk in terms of empirical 
covering numbers, and plays an essential role in sample error estimates. Recall that for a subset ^ of a 
metric space (iJ, d), the covering number N{Q, e, d) is defined by 


M{g, e, d) = inf |/ e N : 3/i, /s, • ■ ■ , fi C H such that G C U{/ e G : d(/, /,) < e}| . 
Assumption 3.4. Let G be a set of functions on X. The metric d 2 ,z is defined on G by 


1/2 


d2Af^9) = > f,9^G- 


We assume that for some C S (0,2), > 0, the covering numbers of the unit ball Bi in TIk with respect 

to d 2 ,z satisfy 

Ez [logA/'(Ri,e,d 2 ,z))] < c<; , V e > 0. (3.6) 

The smaller is C the more stringent is the capacity assumption. As C approaches 2 we are essentially 
considering a capacity independent scenario. In what follows, we will briefly comment on the connection 
between the above assumption and other related assumptions. Recall that capacity of the RKHS may 
be measured by various concepts: covering numbers of balls Bji in TIk, (dyadic) entropy numbers and 
decay of the eigenvalues of the integral operator Lk ■ L'^ ^ L'^ given by Lxif) = fx f{x)Kj;dp{x), where 
= {/ : A —>• R : J \ f{x)\'^dp}. For a subset G of a, metric space {H,d), its n-th entropy number is 
defined by 

r 

en{G, d) = inf <1 e > 0 : 3/i, / 2 , • • • , / 2 —i such that G Cl [J {f & G ■- d{f, fi) < e} 

i=i 



First, the covering and entropy numbers are equivalent (see e.g. [34, Lemma 6.21]). Indeed, for C > Oj 
the covering numbers Af{G, e, d) satisfy 


\ogM{G,e,d) < a(^ 



Ve > 0, 
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for some a,^ > 0, if and only if the entropy numbers en{G,d) satisfy 

en{G, d) < a'(-, 

for some a'^ > 0. Second, it is shown in [33] that if the eigenvalues of the integral operator Lk satisfy 

2 
c 

n > 1 

for some constants > 1 and C S ( 0 , 2 ), then the expectations of the random entropy numbers 
]Ez[e„ (Bi,(i 2 ,z)] satisfy 

]Ez[e„ (Bi,(i 2 ,z)] < ac > n>l 

for some constant a^. Hence, using the equivalence of covering and entropy numbers, [log A/" (i?i, e, d 2 ,z))] 
one can be estimated from the eigenvalue decay of the integral operator Lk- Last, since d 2 ^z.if,g) < 
II/ —g||oo, one has that for any e> 0,J\f (Hi, e, d 2 ,z) is bounded by Af (Hi, e, || • ||oo), the uniform covering 
number of Hi under the metric || • ||oo) Thus, the covering numbers jV(Hi, e, d 2 ^z) can be estimated given 
the uniform smoothness of the kernel [46]. 




3.2 Finite Sample Bounds for General Convex Loss Functions 

The following is our main result providing a general finite sample bound for the iterative regularization 
induced by the the subgradient method for convex loss functions considering the last iterate. 


Theorem 3.5. Assume (3.1) with q > 0, (3.2) with r G [0,1], (3.5) with j3 G (0,1] and (3.6) with 
C G (0, 2),. Let rjt = rjit~^ with 0 < 6 < I satisfying 6 > and rji satisfying 


0 < rji < min • 




1 - 0 


V2c,(k + 1)9+1 ’ djl/jo f ■ 


If T is the integer part of \nrA), then for any 0 < ^ < 1, with confidence 1 — i5, we have 


£{fT)-£{f;)< 


Cm “ log I, 


when 9 > 


Cm “logmlog|, when 9 < 
where the power indices 7 and a are defined as 


7 = 


when 9 > 


l-e (l+2/3)(2-r+CT/2)+g(l+C/2)’ 
l_e (i + M(£(i±|)z:ai)(2-r+Cr/2)+9(l+C/2)’ ^ ^ 

_ 3 _ irhen ft 9+1 

;3(2-r+CT/2) + { ^-"y"/" + 2i+^}’ ® ® - 9+2 ’ 

^ . 2 -t+ct /2 ^ ,(i+;/ 2 ) 1 ) when 9 < , 


;3(2-r+CT/2)+ ,(4-CJ - 


4’ 


(3.7) 


(3.8) 

(3.9) 


and C is a constant independent of m or 6 (given explicitly in the proof). 


The proof is deferred to Section 5 and is based on a novel error decomposition, discussed in Section 3.6, 
integrating statistical and optimization aspects. We illustrate the above result for Lipschitz loss functions, 
that is considering q = 0. 
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Corollary 3.6. Assume (3.1) with q = Q, (3.6) with ( G (0,2) and (3.5) with P G [0,1]. Let rjt = r]it~^ 
with 0 < 0 < I and r]i satisfying 0 < 771 < min | ■> . If T is the integer part of |", then 

for any 0 < 5 < 1 , with confidence 1 — <5, we have 


£ifT)-£iO< 


Cm “log|, when 6 > I, 

Cm”" logmlog when9<i^, 


where the power indices 7 and a are defined as 


1 


a 


_ 2 _ 

(l-9)(2/3+l)(2-r+Cr/2)’ 
_ 2 _ 

(l-9+2^e)(2-T+CT/2)’ 

_ 20 _ 

(2/3+l)(2-T+Cr/2)’ 

20/3 

(l-0+2/30)(2-T+Cr/2)’ 


when 0 > 
when 0 < 5 , 

when 0 > 
when 0 < 


and C is a constant independent of m or S. 


The above results give finite sample bounds on the excess risk, provided that a suitable stopping rule 
is considered. While the stopping rule in above theorems is distribution dependent, a data-drive stopping 
rule can be given by hold-out cross validation and adaptively achieves the same bound. The proof of this 
latter result is straightforward using the techniques in [14] and is omitted. The obtained bounds directly 
yields strong consistency (almost sure convergence) using standard arguments. Interestingly, our analysis 
suggests that a decaying stepsize needs to be chosen to achieve meaningful error bounds. The stepsize 
choice can influence both the early stopping rule and the error bounds. More precisely, if the step size 
decreases fast enough 0 > the stopping rule depends on the decay speed but the error bound does 
not. In this case the best possible choice for the early stopping rule is 0 = that is 774 ~ l/-\/t in the 
case of Lipschitz loss functions. With this choice, if for example we take the limit /I —?► 1, r —>■ 0, we have 
that the stopping rule scales as 0{m?^^) whereas the corresponding finite sample bounds is 
A slower stepsize decay given by 0 < affects both the stopping rule and the error bounds, but the 
results in these regime worsen. A more detailed discussion of the obtained bounds in comparison to other 
learning algorithms is postponed to Section 3.5. Next we discuss the behavior of different variants of the 
proposed algorithm. 

As mentioned before in the subgradient method, when the goal is empirical risk minimization, the 
average or best iterates are often considered (see (2.2), (2.3)). It is natural to ask what are the properties 
of the estimator obtained with these latter choices, that is when they are used as approximate minimizers 
of the expected, rather than the empirical, risk. The following theorem provides an answer. 


Theorem 3.7. Under the assumptions of Theorem 3.5, ifT is the integer part of \m'^"\ and gr = ax (or 
bx) then for any 0 < 5 < I, with confidence I — (5, we have 


Cm “ log 


£{9T)-£{fn<{ 


5’ 


when 0 7 ^ 

I _ g+1 


Cm “logTTrlogj, when 0 = 

where the power indices 7 and a are defined as in Theorem 3.5 and C is a constant independent of m or 
5 (can be given explicitly). 


The above result shows that, perhaps surprisingly, the behavior of the average and best iterates is 
essentially the same as the last iterate. Indeed, there is only a subtle difference between the upper bounds 
in Theorem 3.7 and Theorem 3.5, since the latter has an extra logm factor when 0 < In the next 
section we consider the case where loss is not only convex but also smooth. 








3.3 Finite Sample Bounds for Smooth Loss Functions 

In this section, we additionally assume that V{y,-) is differentiable and V'iij,-) is Lipschitz continuous 
with constant L > 0, i.e., for any y GY and a,b gM., 

\V'{y,b)-V'{y,a)\<L\b-a\. 

For the logistic loss in binary classification, see Example 2.1, it is easy to prove that both V{y,-) and 
V'{y, •) is Lipschitz continuous with constant L = 1, for all y gY. With the above smoothness assump¬ 
tion, we prove the following convergence result. 

Theorem 3.8. Assume (3.1) with q > 0, (3.2) with r G [0,1], (3.5) with (3 G (0,1] and (3.6) with 
C G (0,2). Assume that V{y,-) is differentiable and V'{y,-) is Lipschitz continuous with constant L > 0. 
Let rjt = 'nit~^ with 0 < 9 < 1 and 0 < 771 < min(^j^^, {Liff)~^). If T is the integer part of then 

for any 0 < 5 < 1, with confidence 1 — <5, we have 

^ifr)-£ifp) < Cm-“log^, 

where the power indices 7 and a are defined as 

_ _ 2 1 
^ “ 1-9{1 + 2I3){2-t + Ct/ 2) -h <7(1 -f C/2) ’ 

_ P _ 

/3(2 - r + Cr/2) + ’ 

and C is a constant independent of m or 6. 

The proof of this result will be given in Section 5. We can simplify the result by considering Lipschitz 
loss function {q = 0) and setting r = 0. 

Corollary 3.9. Under the assumptions of Theorem 3.8, let q = 0. IfT is the integer part of [m'*'], then 
for any 0 < 5 < 1, with confidence 1 — <5, we have 

^ifr)-£ifp) < Cm"“log|, 

where the power indices 7 and a are defined as 

2 2/3 

(l-0)(2/3 + l)(2-T + Cr/2)’ ““ (2/3 + l)(2-T + Cr/2)’ 

and C is a constant independent of m or S. 

The finite sample bound obtained above is essentially the same as the best possible bound obtained 
for general convex loss. However, the important difference is that for smooth loss function, a constant 
stepsize can be chosen and allows to considerably improve the stopping rule. Indeed, if for example we 
can consider the limit /3 —>■ 1, r —>■ 0, we have that the stopping is rather than 0(m?/^), whereas 

the corresponding finite sample bounds is again 0(m“^/^). 
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3.4 Iterative Regularization for Classification: Snrrogate Loss Functions and 
Hinge Loss 

We briefly discuss how the above results allows us to derive error bounds in binary classification. In this 
latter case Y = {1,-1} and a natural choice for the loss function is the misclassification loss given by 

V{y,b{x)) =Q{-yb{x)) (3.10) 


ioT b : X ^ Y and 0(a) = 1, if a > 0, and 0(a) =0 otherwise. The corresponding generalization error, 
usually denoted by TZ, is called mislassification risk, since it can be shown to be the probability of the 
event {{x,y) G Z : y &(a:)}. The minimizer of the misclassification error is the Bayes rule bp : X ^ Y 
given by 


J 1, if the conditional probability p{y = l|a:) > 1/2, 
( —1, otherwise. 


The misclassification loss (3.10) is neither convex nor smooth and thus leads to intractable problems. 
Moreover, the search of a solution among binary valued functions is also unfeasible. In practice, a convex 
(so called surrogate) loss function is typically considered and a classifier is obtained by estimating a real 
function / and then taking its sign defined as 


sign(/)(x) 


1 , if f{x) > 0, 
— 1, otherwise. 


The question arises of if, and how, error bounds on the excess risk £{f) — £{f^) yields results on 
TZ(signf) —TZ{bp). Indeed, so called comparison results are known relating these different error measures, 
see e.g. [17, 34] and references therein. We discuss in particular the case of the hinge loss function, see 
Example 2.1, since in this case for all measurable functions / it holds that 


7^(sign/)-7^(6p) <£(/)-£:(/;). 

Indeed, the hinge loss satisfies Assumption (3.1) with <7 = 0 and, under Tsybakov noise condition. 
Assumption (3.2). Misclassification error bound, for the iterative regularization induced by the hinge 
loss, can then be obtained as a corollary of Theorem 3.5 and using the above facts. Below we provide a 
simplified result. 

Theorem 3.10. Let Y = {1,-1} and V be the hinge loss. Let 0 < e < ^ and (3.5) is satisfied with 
(3 G (0,1]. Let rjt = 'nit~^ with 9 > 1(2 and 0 < rji < min 
and T is the integer part of [m ], then with confidence 1 — <5, we have 

'R- {signifr)) - TZUc) < Cm~^ log p (3.11) 

d 

In particular, if (3 > with e G (0,1/3), then with confidence 1 — S, 

n (signifT)) - 7^(/c) < Cm^~^ log 

The proof of the above result is given in Section 5, whereas we comment on the obtained rates in 
the next section. We add one of observation first. We note that, as illustrated by the the next result, a 
different regularization strategy than early stopping can be considered, where the stopping rule is kept 
fixed while the step size is chosen in a distribution dependent way. 


V (1-^) 1- 0 

\/2(k+1)’ 4 


> . If (3.6) is valid with ( G (0,2) 
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Theorem 3.11. Let Y = {1,-1} and V be the hinge loss given. Let 0 < e < ^ and (3.5) is satisfied 
with 1 > /3 > 1^. Let rjt = with 9 = 0 < ryi < min . If (3.6) 

is valid with ( G (0, 2) and T is the integer part of then with confidence 1 — <5, we have 


n {sign{fT)) - 7^(/c) < Cm^ ^ log 


3.5 Comparison with Other Learning Algorithms 

As mentioned before iterative regularization has clear advantages from a computational point of view. 
The algorithm reduces to a simple first order method with typically low iteration cost and allows to 
seemly compute the estimators corresponding to different regularization level (the regularization path), a 
crucial fact since model selection needs to be performed. It is natural to compare the obtained statistical 
bounds with those for other learning algorithms. For general convex loss functions, the methods for which 
sharp bounds are available, are penalized empirical risk minimization (Tikhonov regularization), i.e. 


/z,A = argmin{£: 2 (/) + A||/||^} , A > 0, 
feUK 


see e.g. [17, 34] and references therein. The best error bounds for Tikhonov regularization with Lipschitz 
loss functions, see e.g. [34, Chapter 7], are of order ) with 


a = mm 


/3 


which reduces to 


_ 

(3 + r (2 - c/2 - r + rC/2)/? + c/2 

/3 + 1 


if no variance assumption is made (r = 0) and in capacity independent limit (C —t 2). While from 
Theorem 3.5 for Lipschitz loss functions, we see that the bound we obtain are of order with 

exponent 


2/3 


reducing to 


(2/3+l)(2-r + Cr/2)’ 

P 


2/3 + 1 

in no variance and capacity independent limit. The obtained bounds are worse than the best ones 
available for Tikhnov regularization. However, the analysis of the latter does not take into account the 
optimization error and it is still an open question whether the best rate is preserved when such an error 
is incorporated. At this point we are prone to believe this gap to be a byproduct of our analysis rather 
than a fundamental fact, and addressing this point should be a subject of further work. Moreover, we 
note that our analysis allows to derive error bound for all Nemitski loss functions. 

Beyond Tikhnov regularization, we can compare with the online regularization scheme for the hinge 
loss. The online learning algorithms with regularization sequence {At > 0}t defined by 


ft+i — 


(1 - ?7tAt)/t: 

(1 - r]th)ft + VtytKa;t, 


if ytftixt) > 1, 
if ytft{xt) < 1. 


(3.12) 


were studied in [43, 42]. Our results improves the results in [43, 42] in two aspects. The bound obtained 
in [43] is of the form 0(T’^~i) while the bound in Theorem 3.11 is of type by substituting the 
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expression for T. Moreover, our results are with high probability and promptly yields almost sure 

convergence whereas the results in [43] are only in expectation. We note that, interestingly, sharp bounds 
for Lipschitz loss functions are derived in [25], although the obtained results do not take into account 
capacity and variance assumptions that could lead to large improvements. 

We next compare with the previous results on iterative regularization. The only results available thus 
far have been obtained for the square loss, for which bounds have been first derived for gradient descent 
in [11], but only for a hxed design regression setting, and in [41] for a general statistical learning setting. 
While the bounds in [41] are suboptimal, they have later been improved in [4, 14, 27]. Interestingly, sharp 
error bounds have also been proved for iterative regularization induced by other, potentially faster, iter¬ 
ative techniques, including incremental gradient [28], conjugate gradient [7] and the so called ^-method 

[4, 14], an accelerated gradient descent technique related to Chebyshev method [18]. The best obtained 

20 

bounds are of order 0{m ) and can be shown to be optimal since they match a corresponding min¬ 

imax lower bound [13]. Holding not only for the square loss, but for general Nemitski loss functions, the 

_ 2/3 

bound obtained in Theorem 3.8 is of order 0{m (a+oo+u which is worse. In the capacity independent 

_ /3 

limit, the best available bound we obtain is of order 0{m ^(/s+i) whereas the optimal bound is of order 

_ /3 

0{m (3+1). Also, in this case, the reason for the gap appears to be of technical reason and should be 
further studied. 

Finally, before giving the proof of our results in details, in the next section, we discuss the gen¬ 
eral error decomposition underlying our approach, which highlights the interplay between statistics and 
optimization and could be also useful in other contexts. 


3.6 Error Decomposition 

Theorems 3.5 and 3.8 rely on a key error decomposition, that we derive next. The goal is to estimate the 


excess risk S{fT) — and the starting point is to split the error by introducing a reference function 

/* e Hk, 


£{fT) - £{f;) = £{fT) - £{U) + £{U) - £(/;). 

The above equation can be further developed by considering 


(3.13) 


£{fT) - £{f^) = {£.{fT) - £M) + mr) - £.ifT)+£M - £{!*)) + {£{h) - £{f'^)) , (3.14) 


Inspection of the above expression provides several insights. The first term is a computational error 
related to optimization. It quantifies the discrepancy between the empirical errors of the iterate defined 
by the subgradient method and that of the reference function. The last two terms are related to statistics. 
The second term is a sample error and can be studied using empirical process theory, provided that a 
bound on the norm of the iterates (and of the reference function) is available. Indeed, to get a sharper 
concentration the recentered quantity 


{{£{fT) - £(f^)) - {£M - £M^))} + (fz(/.) - £Mp)) - {£if*) - £iO) 

can be considered [17, 34]. Note that the second addend can be negative so that we effectively only need 
to control 

J-.(/*) = max {(£,(/,) -£,(/;)) - {£{U)-£{f^)), 0} . (3.15) 
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Finally the last term suggests that a natural choice for the reference function is an almost minimizer of 
the expected risk, having bounded norm, and for which the approximation level can be quantified. While 
there is a certain degree of freedom in the latter choice, in the following we will consider /, = /a, the 
minimizer of (3.3). With this latter choice we can control 

A{U) = {£{h)-£{fJ)) 

by Il’(A) given in Assumption 3.3. Indeed, other choices are possible, for example 

/fl = arg min £:(/). 

/GBr 

With this choice, A{fR) can be seen to be another standard way to measure approximation properties 
[17, 34]. 

Collecting some of the above observations, we have the following Lemma. 

Lemma 3.12. For i? > 0, we have 

£{fT)-£{fJ) < {{£{fT)-£iO) - {£MT)-£.{fJ))+J^M} + (fz(/r) -fz(/*)) + A(/*). 

(3.16) 

In the next sections we proceed estimating the various error terms in the error. We will first deal 
with the computational error, the analysis of which is the main technical contribution of the paper and 
then proceed to consider the sample and approximation error terms. The best stopping criterion and 
corresponding rates are derived by suitably balancing the different error terms. 


4 Computational Error 

In this section, we will bound the iterates and estimate the computational error, see Lemma 3.12. 


4.1 Bounds on Iterates 

We introduce the following key lemma, which will be used several times in our analysis. 
Lemma 4.1. For any fixed f G 'Hk cind t = 1,... ,T, 

\\ft+i - fWl < Wft - fill + + 2vt[£.{f) - £.{ft)], 


where 


Gi = 


-J2vL{y„Mx,))K, 


i=i 




(k + 1 ) 29 + 2 max|l,||/t||^| . 


K 


Proof. Computing inner product {ft+i - /,/t+i - f)K with fi+i given by (2.1) yields 


wft+i - m = Wft - fwi+ riGi + ^ E yuy,jt{x,)) ( k ,,, / - . 

i=i 


Using the reproducing property 


f{x) = {f,K,)K, V/GHk.xGX, 


(4.1) 


(4.2) 


(4.3) 
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we get 


and 


< ^WIWk, V/ e Hk, 


ll/t +1 - /Wk = ll/t - f\\K + VtGt + “ Mxj)). 

™ j=l 

Since V{yj, •) is a convex function, we have 

V!_{yj,a){b- a) < V{yj,b) - V{yj,a), 'ia,b€ K. 

Using this expression to (4.5) gives 

„ m 

ll/i+i - /Ilif < ll/t - IWr + ViGI + [U( 2 /j,/(a:j)) -V{yj,ft{xj))\ , 

™ i=i 

where the last term is exactly 2'qt[£z{f) — ^z(/t)]- 

By (3.1), (4.4), and the observation Hk = ^JK{xj,Xj) < k, we find 


(4.4) 

(4.5) 


Gt = 


< 


■'^V!_{yj,ft{xj))K^ 


1=1 




K 




■ + l/t(2;j)l'^) < KCq{l + K'^ll/tll^), 


i=i 


□ 


and the desired bound follows. 

Using the above lemma, we can bound the iterates as follows. 

Lemma 4.2. Let 0 < 6 < 1 satisfy 6 > and rjt = 'nit~^ with rji satisfying (3.7). Then for 
t = l,...,T, 


||/t+i||if <t^. 

Proof. We prove our statement by induction. Taking / = 0 in Lemma 4.1, we know that 

Il/t+i||?c < ll/tlll + lit G? + ^Vt[£M - £-.(/*)] < ll/tll^ + ri?G? + 2t7,|U|o. 

This verifies (4.6) for the case t = 1 since /i = 0 and + l)^i+^ + 27^1 |U|o < 1. 

Assume ||/t||if < (^ — 1)^ with t >2. Then 

G? < cI{k + l)29+2(i - i)(i-»)9. 

Hence 

||/t+i||?c < (t-l)'-® + r;?t-'^c2(7^ + l)2«+2iU-®)? + 2ryH-®|U|c 


(4.6) 


< U 


1 - - 
t 


qy. , ,0 

r;2c2(K + 1)29+2 27 ?i|U|o' 


^{q+l)e+l-q 


+ 


t 


Since (l — 7 ) ^ ^ < 1 — and the condition 9 > —^ implies [q + 1)0 + 1 — (7 > 1, we have 


9+1 

2 ^ A-e li 1-^ , vlcl{K + lf‘>+‘^ 277i|U|o 


Finally we use the restriction (3.7) for rji and hnd ||/t_|_i||^ < This completes the induction 

procedure and proves our conclusion. □ 
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By taking f = ft in (4.1), we see the following estimate for the difference ft+i — ft from Lemmas 4.1 
and 4.2. 


Corollary 4.3. Under the assumption of Lemma 4-2, we have for t = 1, ..., T, 




\\ft+l- ft\\K <mCq{K+lY^^t 2 


-0 


(4.7) 


Observe from the restriction 9 > in Lemma 4.2 that the power index in (4.7) satisfies —9< 


2(g+i) - 


< 0 . 


4.2 Computational Error for the Last Iterate 

In this subsection, we will estimate the computational error E^ifr) — E^if*) fo'' some /, € 'Hk- Some 
ideas for estimating the average error in our proof are from [10, 30]. 

Lemma 4.4. Assume (3.1) with q > 0. Let /* S T-Lk- If Vt = with 0 < 6 < 1 satisfying 0 > 

and r]i satisfying (3.7), then we have 


EYfT)-EYf*)< 

1 




\ 2?7i 


+ CgEz{f*) + Cl ) At 


2rji k + 1 k ^ 
o fc=l L t=T— 


- Y. 2??* - 2pT-k 


- fc +1 


{EYfr-k) - EYU)}: 


when 9 > 
when 9 = 


(4.8) 


(4.9) 


where At is defined by 

(iogr)r-(i-«), 

(logr)r-(®(i+«)-«), when 0 < 

c'g := jfw (l + max{(2 + log4)(l + \ogf)t~^ : t G N}) and Ci is a constant depending on q,K,9 (indepen¬ 
dent ofT,m or /, and given explicitly in the proof). 


Proof. Lemma 4.1 is key in our proof. In particular, we shall apply the following equivalent form of 
inequality (4.1) from Lemma 4.1 several times with various choices of / G Pk'- 


2vt [EMt) - EM)] < {Wft - /III - ll/t+i - /III} + 


(4.10) 


Step 1: Error decomposition. Decompose the weighted empirical error 2r]TEz{fT) as 
2r]TEzifT) = {^PTEzifr) + 2r]T-iEz{fT-i)} 

P^^-qT {Ezifr) - Ez{fT-i)} + 2 {^Vt - 277 t - i }^ z (/ t - i ) 

= 2 {2?7t^z(/t) + 2qT-lEz{fT-l) + 2qT-2Ez{fT-2)} 

+ 2^3 {2^t ['^^z(/t) - ^z(/t-2)] + 277T-1 [Ezifr-i) - EzifT- 2 )]} 
+ -^2qT {Ezifr) - EzifT-i)} + 2 {2??t - 2?7t-i} fz(/T-i) 

+ 2 ^ 3 {[2»7t — 277 T- 2 ] + [2»7t-i — 277T-2]}fz(/T-2)- 
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Repeating the above process by means of the decomposition 


fc— 1 ^ fc 

- ^ 2r]T-j£z{fT-j) = ^ 2T]T-j£z{fT-j) 


j=o 


j=o 


^ fc-i ^ fc-i 

+ fc(fc_pl) XI - fz(/T-fe)} + ^ “ 2?7T-fc} Szifr-k) 


with fc = 3,..., T — 1, we know that 

r-i 
T 


r-i fe-i 

2?7t^’z(/t) = Tf; 2?7T-jgz(/r-i) + VI i 2?7T-j {gz(/T-j) - gz(/T-fc)} 

j=o 

T-l , fe-l 


fc(fc + 1) 

fc=l ^ ' j=0 


k(k\i] ^ “ 2?7T-fc}^z(/T-fe). 

/c=l ^ ' j=0 


Hence the following error decomposition holds true: 


2vt {Szifr) - £z{U)} = y E 




T-l 


fc=l ^ ' t=T-k-\-l 


T-l 




T-l 


fc=l 


A: + 1 


T E 2?7t - 2r?T-i 


£z(/*) 


+ E 

fc=i 


fc + 1 


1 

- ^ 2r,t- 2r]T-k 


t=T-k+l 

{£z{fT-k) - £z{U)}. 


t=T-fc+l 


(4.11) 


Step 2: Average error in the first term of (4.11). Choosing / = /» in (4.10) and taking summation 
over t = 1,..., T together with (4.2) and Lemma 4.2 yields 


t=i 




Since 1 > 9 > we hnd —2 < q{l — 9) — 29 < 0. Moreover, g(l — 9) — 29 < —1 if and only if 0 > 


g +2 ■ 


The following bound for the first term of (4.11) then follows 


-E 20 t{^z(/t)-^z(/*)} 


<< (llMI^ + 2C,..)(iogr)r-i, when0 = 2^, 

(lIMII + T<^G2+,)o^ ^hen 9 < 


where Cq^K, is the constant given by 

Cq,K = r7?c^(K + 1)^'^+^- 
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Step 3: Moving average error in the second term of (4.11). Let fc G {1,..., T— 1}. Choosing / = fr-k 
in (4.10) and taking summation over t = T — k + 1,... ,T yields 


^ 2pt{£.ift)-£.ifT-k)}<\\fT-k+l-fT-k\\l+ Y. 


t=T-fc+l 

By Corollary 4.3, 




WfT-k+l - fT-kWl < + l)2(9+l)(r _ 

This bound is the term with t = T—k-\-loi the following estimate which is a consequence of Lemma 4.2 

T T 




-e)- 2 e 


t=T-k-\-l 


t^T-k-\-l 


Hence 


^ 2i^t{£.{ft)-£.{fT-k)}<Cq,, 

t=T-fe+l 


^q{l-e )-20 _ ^Y{l-9)-2e 

.t=T-k+l 


Denote q* = 29 — g(l — 9). We know that 0 < g* < 2 and g* = 1 when 9 = So 


^q{l-S)-29 < 

t=T-k+l 


X ^ dx < 


'T-k 


1 — q ’ ' g +2 ’ 

log 7 ^, when 0=2^. 


When 9 < , we have g* < 1 and for k < ^, we see from the mean value theorem that 


1 - g* 1 - g* “ 1 - g* 

which is exactly (T — k)~‘^ k and bounded by 2'? T~^ k. It follows that 


T-l 


'^k(k + l) L] 277t{£lz(/t) £lz(/T-fe)} 

fc=l ^ ' t=T-k+l 

1 


<2C',,« y] 


fe<T/2 


fc(fc + l) 


2^ r-”? fc + 2C', 


^ 1 Ti--?* 

fc(fc + 1) 1 — g* 

T-l>fc>T/2 ^ ^ ^ 


< 2 a . 2 « + 


1 -g* 


(logT)T«li-®l 




When 9 = we we see from the mean value theorem that 


log 


T 


T-k 


= -log(l--) < 


k 1 


V TJ - Tl - ^ T-k 


It follows that 


T-l 


'^k(k + l) L/ {£z.{ft) fz(/T-fe)} 

fe=l ^ ^ t=T-k+l 

T-l T-l 

- X! Tt^TTuI “ X! 


{T-k)k '^’^T ^ \ k T-k 

k=l ' ^ fc=l ^ 


1 1 


< 4a,« 


logT 

T 
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When 9 > 2^, we have g* > 1 and for fc < 


Then 


1 - g* 


AU-?* _ 1 

■5A- < 2 ? T-« fc. 

g* - 1 


T-l 


^ fcffc + T) ^ 277t{fz(/t) fz(/T-fc)} 

c=l ^ t=T-fc+l 


T-l 


< 2 « +^C, 


^ ^ logT 


< 2 «*+^C', 


1 


fc=l 


^-l 


* 1 ' 
g* — 1 

Thus the second term of (4.11) can also be bounded as 


T-l 


E 


1 


k(k + l) E ‘^Vt{£z{ft) Szifr-k)} 

k=l ^ ' t=T-k+l 


g*-l 




when 9 > 
when 9=^, 


<{ 4C',,,(logr)T-\ 

2 C,,, (29* + (logT)T9-(2+9)«, when 0 

Step 4; Error concerning if*) in the third term of (4.11). Let fc € {1, ■ ■ ■ ,T}. We have 

o n 1 f* -e , ^ Ti-®-(T-fc) 

^ 27]t<2r]i- / X ^dx<2r]i- 


i-e 


i=T-fc+l i=T-fe+l 

Putting this estimate in the coefficient of the third term of (4.11), we find 

W 9.n. - 2ti^ -I- W —— - 

T 


k{l-9) 


T 1 

: ^ 277t - 277T + X! 




A: ffi 1 


y^ 2 ? 7 i — 2r]T-k 


t=T-/c+l 


T-l 


fc=l 


-{T-k) 
k{l-9) 


1-9 


-{T-ky 


2mO_T-e + [Ti-« - (T - fc)!"^ - fc(l - 9){T - fc)"^] 


1-0 l-0^fc(fc + l) 


2r;i0 g 2r?i fc 

1-0 fc(fc + l)^''r'^’ 


1 - 0 


we [0,1). 


where g : [0,1) —>■ K. is the function defined by 

g{u) = —1 + (1 — uy~^ + (1 — 0)w(l — u)~^, 

A simple computation gives its derivative 

g'{u) = 9{l - 9)u{l - u)-^-y 

So g is an increasing function and is positive on (0,1) by noting g(0) = 0. Observe that 

.fc/T ^ 

^du = {u + —) ‘^du 

J(k-1)/T 4 


T 


T T 


r{k+l)/T 


.- 2 , 


fc(fc + 1) fc fc + 1 


Ik/T 
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and gi^) ^ gi'^) for u G {{k — 1)/T, k/T). Hence 


T-l 


E 


^-1 pk/T 


T k 

k=ik(k + ^f^T^ - ^J{k-i)/T 


^ AT-1)/T ^ 

{u+-)-‘^g{u)du = J iu+ -)~‘^giu)d 


1 


-iu + ^) ^giu) 


J 0 


(T-l)/T p(T-l)/T ^ 

+ {u + g'{u)dv 


T-l f 

= -g{-^) + e{i-9)j^ 

By the definition of the function g, we see 


(T 1)/T _ y) 1 e 


du. 


-gi^-TfT^) = 1 - -i^-9) (T® - T®-^) = 1 - - (1 - 6i)r^ 


Writing 


1 1 1 
= 1 - --^ = 1 - 


T u + ^ 


T 1 + ^ — [1 — u) 


= 1 - 


1 


T + 1 


1 - 


1 — M 
1 + y 


we use the Taylor expansion for the integral and find 


r-(T-l)/T 


t(l — u) 


- 1-0 


U + y 


-du = 


AT-1)/T 

/ (l-«) 

^0 

-1 1 


- 1-6 


1 - 


E 


r + 1 ^ vr + 1 

fe =0 ^ 



> 


^ 'Y'O _ 


r + 1 9 


1 1 ^ / T y 1 

“~T + i^lr + i/ k-e' 


du 


We notice that > 2 for any T >2 (with limit e), which implies 

k 


=_ 

T+lJ (1 + T) 


T< 2 ^ 


V(^ - i)r +1 < fc < £T, £ e N. 


It follows that 


^ I T+ 1 / k-9-^ ^ k-e 

^ ^ e=i k={e-i)T+i 


k=l 


1 ^ T-0 £T-9 

< ^^ + log -TT + E 2 log 77^ 

i=2 


< 


1-1 

1 


1-1 


{£ -i)T-e 


1 


1 _ 0 + ITTg + 21og4 + logT. 


Therefore, we have 

T-l 


E 

fc=i 


^ - - -'—e-i n a\rrS t a\ ^ 


-90--9) +log +21og4 


>0+^-T-i-(2 + log4) 


1 + logT 
T + 1 
1 + log T 
T + 1 ■ 


(4.12) 
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This tells us that the third term of (4.11) can be estimated as 


T T—1 r T 

+ I I] 2m-2vT- 

t=l fe=l . t=T-k+l 


•£z{f* 


^ — g / ^0 — 2 

-1-9 


+ (2 + log4) £z(/*) < 2,7iT-^c',T^-i£-.(/,). 


Putting all the above estimates for the first three terms into (4.11), we see that the desired bound 
(4.8) holds true with the constant Ci given explicitly by 

^ 'llkq(K-kk) (2+g)e_q_l i 

Cl = <^ 6 ? 7 ic 2 (k + 1 ) 29 + 2 , 

^ mclin + 1)29+2 (2(2+9)«-9 + , when 9 < 

The proof of Lemma 4.4 is complete. 


when 9 > 
when 9 = !^, 


Lemma 4.4 is useful and can be used in a stochastic convex optimization problem, other than learning. 
In what follows, we shall see that how it can be used in our specified learning problems. For notational 
simplicity, with R > 0 we denote 

M^(R)= sup O}. (4.13) 

Proposition 4.5. Under the assumptions of Lemma 4-4, we have 

Szifr) - £z{U) < (t^) + [c'sKt + + A{U)) 

_|_ 11 / (4.14) 
2771 

where C 2 is the constant given by C 2 = c'g (|F|o + Cq{l + \\f'^ 11^)11/;^Iloo) + Ci. 

Proof. Note that by Lemma 4.4, we have (4.8). The first term in the bound (4.8) involves the empirical 
error £z{f*) which can be estimated as 

£z{h) = {£M-SMp))-m*)-£{fp)) + i£{h)-£ifp)) + £Mp) 

<AM + A{U)+£.if^). 

Also, condition (3.1) implies 

iny,/p^(^))l<l^lo + c,(i + |/;(x)|^)|/;(x)|<|F|o + c,(i + ||/;i|y||/;iu. 


Hence, 


fz(/p^)<|F|o+C,(l + ||/;||^)||/;||oo. 
With these, we can bound the first term of (4.8) as 


11 /"^ 


*\\K 


2t]i 


+ Cpfz;(/,) + Cl I At < Cg {J-z{f*) + -4(/,)) At + 


11 /."^ 


* 11 ^ 


2t]i 


+ C 2 I A^. 


What is remained is to estimate the second term of (4.8) denoted as 

T 


J . V 1 

2771^, fc + 1 


277T-fe -I 277* 


t^T-k+1 


{£zih) - £zifT-k)} . 


20 


















Denote R = T ^ . Lemma 4.2 tells us that fk G for each fc = I,-- - ,T. It follows that for 
= ,T-1, 


SM - s.ifr-k) = - {£{f*) - £ifp))} 

+ (£(/*) - £{f^)) + - s.ifT-k) 

< Mf*) + Af*) + MAR). 

By the choice of the step sizes, 2r]T-k — ^ J2t=T-k+i > 0 for any fc G {1,..., T — 1}. Therefore, Jt,z 
can be bounded by 


T 


0 T-l 


1 


S fc + 1 


‘2r]T-k 


1 

k 


T 


E 




[rM + AU) 



Now we need to bound the above summation. Note that, for each k, 

2r]T-k - ^ J2 ^ E ■ 

t=T-k+l t=T-k+l 

Applying the mean value theorem to the function g{x) = —x~^ on [T — k, t] with t G {T — fc + 1,..., T}, 
we find that for some c G (T — k,t), 

(T - k)-^ - = git) - giT - k) = (t - (T - k))g'ic) < [t - [T - k))9iT - k)-^-\ 


Hence 


T-l 


fc=l 


Stir S 


t^T-k+l 


<2gi0 Y, 


(T-k) 


- 0-1 


k{k + 1 ) 

k<T/2 '• ' t=T-fc+l 


E (^ “ ^ + ^) + E 


k>T/2 


k + 1 


2f]T-k 


<27710 ^ 


(T- fc(fc + 1) 

fc(fc + 1) 2 


E ^^9T-k 


k<T/2 ^ ' k>T/2 

4771 


6771 rr.-0 


<7710 ^ ^ (r-7c)-®<^T 

k<T/2 k>T/2 


Thus 

Jt.z < {J-z(/,) + AiU) + Mz(i?)}. 

Then the desired bound follows from Lemma 4.4. 


□ 


4.3 Computational Errors for Weighted Average and Best Iterate 


Lemma 4.6. Under the assumptions of Lemma let gx = ut (or gx = bx)- Then 

£zibx) - £zif.) < (^^^^ + Ci)at, 

where Ax is given by 

{ whene>f^, 

At=< (logr)T-(i-«), whene=^, (4.15) 

y T-(®(i+«)-«), when 0 < 

and Cl is a constant depending on q,K,0 (independent ofT,m or /* and given explicitly in the proof.) 
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Note that there is a subtle difference between At and At defined by (4.9), where the later term has 
an extra log m for 0 < . 

Proof. For any m € R, we have 

-u)> p- 

t=i \t=i / ‘ *’■■■’ \t=i / 

and by convexity of £z, 

/ T \ T 1 ^ 

£^{aT) = £z < '^uJt£^{fT) = —- '^Vt£z{ft)- 

\t=i / t=i t=i 


Therefor, we have 


and 


£^{bT)-u< - ^r]ti£^{ft) - u) 

Z^t=i Vt t=i 


£z.{aT)-u< - ^r]t{£^ift) - u). 

Z2t=i Vt t=i 


We thus get 

1 ^ 

£.{gT) - £.{n < -T-E ^t{£.{ft) - £M*))- 

Z^t=i 0* t=i 

Following Step 2 of the proof of Lemma 4.4, we have 


T 

E 20 t {£.{ft) - £.{f*)} 

{ (ll/*lll: + P.K (2+t)e-g-i) ’ when0>^, 

< (IIMIk + 2P„) (logT), when 9 = gi, 

i (lIMIIc + when 6 < 


(4.16) 


Introducing the above inequality into (4.16), and using 0* — 0i /tE P > r]iT^ ®/2, we get 
our desired result with Ci given by 


_ f 277ic2(«; + l)29+2^g^^^ when0>^, 
Ci = < 4 ? 7 iCg(/v + 1 )^'J+^, when 0 =^, 

[ 277ic2(/c +1)29+2When0<^. 


□ 

While the above proof is shorter and easier than the proof of Lemma 4.4, it is surprising that the 
computational error bounds for the last iterate and the average (or the best one) are roughly of the same 
order. 

4.4 Iterate Bound and Computational Error for Smooth Loss Functions 

The following result can be proved by using the fact that V'{y, ■) is Lipschitz. 
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Lemma 4.7. Let 0 < % < (Lk^) ^ for all t G N. Assume that V{y,-) is differentiable and V'{y,-) is 
Lipschitz continuous with constant L > 0. Then we have 

gz(/r)-gz(/.)< ■ 

/2k=l 

In particular, if rjt = r]it~^ with 9 G [0,1) satisfying rji < then 

fz(/T)-fz(/*) < 

Vi 

Proof. Since V'{y, ■) is Lipschitz with constant L for any y G Y, we have for any a, 6 G R, 

V{y,b) < V{y,a) + V{y, a){h - a) + ^{b-af. 

Choosing y = yj, b = ft+i{xj) and a = ft{xj), according to the reproducing kernel property (4.3) and 
(4.4), we get for j = 1, • • • , m and f € N, 

V{yjJt+i{xj)) < V{yjJtixj)) + V'{yjJt{xj)){ft+i - ft,K^.)K + —||/t+i - MIk- 
Summing up over j = !,■■■ , m, with Gt = ^ V'iyjJt{xj))K,„., we get 

T ^2 

fz(/t+l) < Szift) + {ft+1 — ft,Gt)K H-^ll/i+l — /till-- 

Introducing with (2.1), noting that pt < {Lk‘^)~^, we get 

fz(/t+i)<fz(/0-|l|Gifx- (4.17) 

By the convexity of V{y, •), it is easy to prove that 


£zift)<£z{U) + {ft-h,Gt)K. 

Introducing this inequality into (4.17), we get 

£zift+i) < £z{f*) + ^^{2vt{ft-f.,Gt)K-vnGtrK) 

= UU) + ^ (ll/t - fA\l - Wft -u- mCtWl) 
= UU) + ^ {Wft - fA\l - ll/t+i - UWl ), 


so that, 


2pt{£,{ft+,) - SM) < Wft - M\k - Wft+i - f4l- 


Summing up over t = 1- • • , T, with fi = 0, we have 

T 

2vt{£z{ft+i) - £z{U)) < ll/i - Mil - ll/r+i - Mil < IIMII. 

By (4.17), we have ^z(/t+i) ^ for all t <T. It thus follows that 


II, 


(4.18) 


Y,2vt{£z{fT+i)-£z{f.))<J2^Vt{£zift+i)-£z{U)) < IIMI) 

which leads to the first argument of the lemma. The rest of the proof can be finished by noting that 


T 


> 


fT+l 


Vl 


u ^du > Pi 


pi 


t=i 


□ 
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Using the above lemma, we can bound the iterates as follows. 


Lemma 4.8. Under the assumptions of Lemma 4- 7, we have for t = 1, - ■ ■ , T, 

Wft+iWx < 


\ 


21^10 




In particular, if rjt = pit ® with 0 € [0,1) satisfying pi < then 

\\ft+l\\K < t~. 

Proof Choosing /, = 0 in (4.18), we get for fc = 1, • • • ,t, 

Wfk+l^K < WfkWK + 2pk{£.i0) - £.{fk+l)) < WfkfK + 2%|U|o. 

Applying this relationship iteratively for k = t, - ■ ■ ,1, with /i = 0, we get 

t 

||/t+i||^<2|U|o^77fc, 

k^l 

which leads to the first conclusion. The second inequality can be proved by noting that 




/c=l 


^ = ?71 ^ ® ( 1 + ) < 


m 


a-0 


1-1 


□ 


5 Sample Error and Finite Sample Bounds 

In this subsection, we will estimate sample errors and then prove our main results. 


5.1 Sample Errors 

We first bound the sample error Pz{f*) for some fixed /* G Hk as follows. This is done by applying 
Bernstein inequality. 


Lemma 5.1. Assume condition (3.1) and (3.2) hold. For any /, G Pk with ||/*||if < R, where R>1, 
with confidence at least 1 — f, 

2 f 7 ?'^+^ //?2+(3-r\ 2^ I 

^z(/d<(C'; + 2v^)log-max<^ -, - , A(/d , (5.1) 

0 I m \ m / I 

where C'l is a constant independent of T, m, 5, given explicitly in the proof. 


Proof. We apply Bernstein inequality which asserts that, for a random variable f bounded by M > 0 and 
for any e > 0, 


' 2=1 


me 


■ 1 . 


2(»»(0 + iM «) J 


Here the random variable ^ on Z is given by f{x, y) = V{y, /») — V{y, f^(x)). The increment condition 
(3.1) implies that f is bounded by M := where C( is the constant given by 


C'l := c. 


.9+1 


11 / 


ll/p^llV') 


24 









By condition (3.2), its variance cr^(^) is bounded by 


{£(/,) - £if^)Y < CrR^+<^-^A{h)y. 


Solving the quadratic equation from the Bernstein inequality, we see that with confidence at least 1 — 2 j 
there holds 


2M log I / 2 log I „ 

AM*) < - 


3m 


m 


< {C[ + 2i/c7) log - max ■ 


(0 

f i?9+i 


Applying an elementary inequality 


^ <TX + {1- T)y, V r e [0,1], X, 2 / > 0 


(5.2) 


yields 


R^+^-^ (A(/,))5 


m 


/^ 2 + 9 -ry- 


m 


(AiU))^ 


< 


M) 


^2+9- 


M)- 


Then the desired result follows. 


□ 


We next bound the empirical process over a ball Bj^ for some i? > 0. To do this, we need the following 
concentration inequality. Its proof is similar to that of Proposition 6 in [40], as well as applying [33, 
Theorem 3.5] and ([34, Exercise 6.8]). We omit the proof. 

Lemma 5.2. Let Q be a set of measurable functions on Z, and B,c > 0 ,t G [0,1] be constants such that 
each function f G G satisfies ||/||oo < B and IE(/^) < c(E/)'^. If for some a > B^ and Q £ (0, 2), 

Ez[logA/'(C/, e, d 2 ,z)] < ae“^, Ve > 0, 

then there exists a constant c(- depending only on C such that for any b > 0, with probability at least 
1 — e~^, there holds 

+ c'cv + 2(-) V/ G g, 


where 


f 2-c / g \ 4-2 t4-ct 

rj := max < ^— J , B^+( —j >. 


The following lemma is essentially contained in [40]. We report a short proof for the sake of com¬ 
pleteness. 

Lemma 5.3. Assume (3.1) with q>Q, (3.2) with r G [0,1], (3.5) with /3 G (0,1] and (3.6) with C, G (0,2). 
Let R > 1. Then with confidence at least 1 — there holds for every g G Bj^, 

{£{g)-£{fY)-{£M)-£M^)) 

' .~ ,(.+0 + (4-2.+CG X l _^2+9-r (5.3) 


Yig) - ^ifp )) + C'a log ^ max ■ 


’ _ 2 _ 

77^2-f( 


' \ 

’ V m / 
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and 


Mz{R) < C3 log - max • 


R 


g(2 + i;) + (4-2T + CT) \ 4-2 x + Ct 


^g+1 ,^2+g-r 1 


2 5 

m2+c 


/ it ^ \ : 

\ m ) 


(5.4) 


77ere Cg is a constant independent of T, m, 5, given explicitly in the proof. 

Proof. We first apply Lemma 5.2 to the function set 

g = {f{x,y) = V{y,g{x)) -V{y,f^{x)) : g € . 

Condition (3.2) tells us that with c = CrRf'^^~'^^ each function f £ G satisfies E(/^) < c(E/)'^. Also, 
condition (3.1) implies that ||/||oo is bounded by M := C[R'^~^^. Notice from (3.1) that for /,/' € G, 

\f{x,y) - fix,y)\ = \V{y,g{x)) -V{y,g'{x))\ < Cg{l + K‘>)R‘>\g{x) - g'{x)\, 

there holds 


A/'(C/,e,(i2,z) < B^, 


Cq{l + K<i)Ri 


, d2,z I < Af ( i?i, 


Cq(l + K‘?)i?9+1 


Hence, condition (3.6) yields the covering number condition in Lemma 5.2 with a = c^c^(1 + k 9)Ci^(9+l)C. 
So we apply Lemma 5.2 and find that with confidence at least 1 — there holds for every / € t/. 




i=l 


where 


V = 


max < ^CrR^'^‘^ 


2-C 

+ Ct 


) 


~2-C /cc-C«(l + K«)?i?(9+l)^V 

V ^ ) 


< C 2 max • 


R 


g(2+;) + (4-2T + CT) \ 4-2T + iT 


i?«+l 


1 2 ( 5 


, 2^ 


where C 2 is the constant given by 

{cl-^clcfd + k«)2C)3=™ +C[^ {c^dqd + 

Apply the elementary inequality (5.2) which yields g^~'^{Efy < g + E/, and notice that E(/) = £(g) — 
^{fj) while A ^ f{zi) = £-z.{g) ~ ^z(/^)- We get that with confidence at least 1 — there holds for 
every g £ B^, we have 

{£{g)-£{fy))-{£dg)-£dfy)) 


+ C(-j g + 2 i^id) — ^(//T)) + 2 Y 

which leads to (5.3) with 


c^i?2+9-^x zAf , 2 ISMlogf > 


-)" " log 


C' = ( - + cM + 2c^— + 18M. 
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Now, introducing (5.3) into the equality 

- fz(ff) = {{£{ 9 ) - £{0) - {£.( 9 ) - £.iO)} - {£{ 9 ) - £{f^)), 

with £{g) — £{f^) > 0 and by recalling the definition of A4z(R), we can derive (5.4). The proof is 
completed. □ 


5.2 Deriving the Finite Sample Bounds 

We have the following result, which will be used for the proof of Theorem 3.5. 

Proposition 5.4. Assume (3.1) with q > 0, (3.2) with r € [0,1], (3.5) with (3 € (0,1] and (3.6) with 
C S (0,2). Let T]t = rjit~^ with 0 < 6 < 1 satisfying 0 > and r]i satisfying (3.7). Let /* G TLk be 
such that ||/*||i<' < R, where R> 1. If 1 < R < T~^ and then with confidence 1 — S, 

we have 


. (l-i))(';(2 + C) + (4-2T + CT)) \ 4~2T4-r 

^(/t) -f(/p'') < Cglog-maxH -—- I 


R^At, Aif.) } . (5.5) 


where C 3 is a constant independent of T, m, 5, given explicitly in the proof. 

Proof. Recall Lemma 3.12. Let R = T^~. Introducing with (4.14), we have 

£(fT) -£{f^) < {(f(/T) - £{0) - {£Mt) -£.{f^))} + {r) 

+ (^CgAr + ^ (Az{f*) + A{f^,)) + + C 2 AT. 

Applying lemmas 5.1 and 5.3 with g = fx, with R G [1, R], we know that with confidence at least 1 — <5, 

' -- q(.2 + 0 + (4~2^ + C^) \ 4-2^ + iT J^q+l 


£{fT)-£ifp ) < c( log - max 


i?" 


’ 2 ’ 
m 2 +c 


~2+£_ 

^ .,2.1 1 


where is the constant given by 
. 4-0 , 


a = 


—, R^At, A{U) + - (£(/ t ) - £{0) , 


4-1 


+ (^c', + ^ j (1 + C] + 2^57)+ — +6*2. 


Since R‘^m < 1 and r G [0,1], C G (0, 2) one finds 


' ~<i(2 + C) + (4-2t + Ct) \ 42227+77 

R 2 \ m^+i 


i?9+l 




m 2 +c 


~(1-t)(2-C) 

4-2t + Ct 


> 1 , 


(5.6) 


and 


'' cr q(2+C) + (4-2T + CT) \ 4-2-r + CT 
it 2 \ m2 


~2+q- 




(2-t)(4-2t + Ct) 


> 1 . 


Subtracting 5 (£(/t) — £(/^)) from both sides of (5.6), and setting C 3 = 2CY we get the desired 


results. 


□ 
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Now we are in a position to prove the explicit probabilistic upper bounds stated in Theorem 3.5. 


Proof of Theorem 3.5. We will use Proposition 5.4 with /» = fx to prove our result. Define a power 
index 9 as 

1 — 0 , when 6 > 

e{l + q)-q, when 0 < ^, 


0 = 


(5.7) 


Comparing this with the definition (4.9) for At, we see that 


At = 


T 


-e 


when 0 > 


T-^logT, when0<^. 


From the definition of D(A), we have 

A{fx)<V{\) and A||/a|||<D(A), 


(5.8) 


which implies H/aH/c < x/'D{X)/X = R. Balancing the orders of the last two terms of (5.5) by setting 

A = At, (5.9) 


we find that the last two terms of (5.5) can be bounded as 

cpT 


max At, A{fx)} < T5>{X) < cpX^ < 


when 0 > 

log T, when0<^. 


Then we balance the above main part with the first term of (5.5) by setting 

/ (l-a)(9(2+;) + (4~2T + {T)) 4-2? + Ct 

_ rp-pe 


This leads us to choose T to be the integer part of 
\m ?"\, where 7 := 


(l^+/30') 


2 +/30) (4 — 2r + Cr) + 2 


g(2+c)(i-e) 


With this choice, the main part of (5.5) can be bounded as 

' (i-a)(9(2+;)+(4~2T+{T)) \ 4 = 2 !+^ 


max • 


< 


m 


, At, Al(/*) 


2 cpm 


-PSj 


when 0 > 

9+1 


(5.10) 


2 'ycpm ^^'''logrn, when 0 < 

Notice from the definition of 0, one can easily prove that 0 < 1 — 0. Then Rj..Jcp < x/Xfi^ < 

1—e 

AjX < T“ 2 “ and the restriction for R in Theorem 5.4 is satisfied up to constants. The restriction 

_q{l-0) 2 

T 2 m 2+9 < 1 is also satisfied because 

_ g(l —q(l —e)7 2 

T 2 < m 2 < . 

Observe that 7 < -j^. So by Theorem 5.4, with confidence 1 — (5, we have 

,. f 20/3(73771“^^^ log I, when 0 > 
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Observe that the power index j39^ is 


/ /3(2-r+Cr/2)+{ ^--y-^^ +^il^^’ When0>«+^, 

= \ P U fl ^ 9+1 

[ /3(2-r+Cr/2)+^^T^{2^+±^+^ii^}’ 

while the index 7 can be expressed by (3.8). Then our desired learning rates are verified by setting 
the constant C = 2cpC^ when 0 > while C = when 6 < 2^. The proof of Theorem 3.5 

complete. 


IS 


□ 


Proof of Theorem 3. 7. We only sketch the proof for the case qt = ot • It is easy to prove the following 
upper bound for ||aT||if by applying Lemma 4.2: 

IIotIIk < T“ 2 “. 

With the upper bound on ot and Lemma 4.6, a similar argument as that for Theorem 3.5, one can prove 
the results. We omit the details. □ 

Proof of Theorem 3.8. With lemmas 4.7, 4.8, and a similar approach as that for Theorem 3.5, we can 
prove the convergence results for smooth loss functions. We omit the details. □ 

Proof of Theorem 3.10. We use Theorem 3.5 to prove the results. The hinge loss satisfies (3.1) with g = 0 
and Cq = |t7|g = 1 and \\f '^||oo = 1 where fj is the Bayes rule fc- Condition (3.2) is valid with r = 0 

and Ct- = 1. Since 9 > 1/2, by simple calculations, one finds that 7 = (i_e)(^ 2 / 3 +i) ^ ~ 213+1 • 

Using the comparison theorem from [44], we have 

7^ (sign(/T)) - 7^(^) < Sifr) - £{f^). 


So the desired probabilistic upper bound (3.11) for the hinge loss follows from the above inequality and 
Theorem 3.5. 

It remains to prove the second part of the theorem. Since 0 < e < |, the restriction (3 > for the 
approximation order tells us that the index 


a = 


P 1 ^1 

2/3 + 1 2 + 1//3 - 3 


The proof of Theorem 3.10 is complete. 


□ 


Proof of Theorem 3.11. Since 0 < e < |, the restriction /3 > for the approximation order tells us 
that the parameter 9 satisfies i < 0 < 1 and the index 


7 = 


1 _ 2 ^ _ 
(l-0)(2/3 + l) ■ 3 


Finally we find that the index 


Q = 


p 


1 1 e 

179 “ 3 ~ 4' 


2/3+1 2 

So the desired probabilistic upper bound follows from the first conclusion of Theorem 3.10. The proof of 
Theorem 3.11 is complete. □ 
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6 Conclusions 


This paper proposes and studies iterative regularization approaches for learning with convex loss func¬ 
tions. More precisely, we study how regularization can be achieved by early stopping an empirical iteration 
induced by the subgradient method, or gradient descent in the case the loss is also smooth. Finite sample 
bounds are established providing indications on how to suitably choose the step-size and the stopping 
rule. Differently to classical results on the subgradient method,we analyze the behavior of the last iterate 
showing it has essentially the same properties of the average, to the best, iterate. These results provide 
a theoretical foundation for early stopping with convex losses. 

Beyond the analysis in the paper our error decomposition provides an approach to incorporate statis¬ 
tical and optimization aspects in the analysis of learning algorithms. While a natural development will 
be to sharpen the bounds and perform extensive empirical tests, we hope the study in the paper can help 
deriving novel and faster algorithms, for example analyzing accelerations [24], or distributed approaches, 
within the framework we propose. 
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